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Abstract 

Genomic distance between two genomes, i.e., the smallest number of genome re- 
arrangements required to transform one genome into the other, is often used as a 
measure of evolutionary closeness of the genomes in comparative genomics studies. 
However, in models that include rearrangements of significantly different "power" 
such as reversals (that are "weak" and most frequent rearrangements) and transposi- 
tions (that are more "powerful" but rare), the genomic distance typically corresponds 
to a transformation with a large proportion of transpositions, which is not biologically 
adequate. 

Weighted genomic distance is a traditional approach to bounding the proportion of 
transpositions by assigning them a relative weight a > 1. A number of previous 
studies addressed the problem of computing weighted genomic distance with a < 2. 

Employing the model of multi-break rearrangements on circular genomes, that 
captures both reversals (modelled as 2-breaks) and transpositions (modelled as 3- 
breaks), we prove that for a € (1, 2], a minimum-weight transformation may entirely 
consist of transpositions, implying that the corresponding weighted genomic distance 
does not actually achieve its purpose of bounding the proportion of transpositions. We 
further prove that for a € (1,2), the minimum- weight transformations do not depend 
on a particular choice of a from this interval. We give a complete characterization of 
such transformations and show that they coincide with the transformations that at the 
same time have the shortest length and make the smallest number of breakages in the 
genomes. 

Our results also provide a theoretical foundation for the empirical observation 
that for a < 2, transpositions are favored over reversals in the minimum-weight 
transformations. 

1 Introduction 

Genome rearrangements are evolutionary events that change genomic architectures. Most 
frequent rearrangements are reversals (also called inversions) that "flip" continuous seg- 
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merits within single chromosomes. Other common types of rearrangements are transloca- 
tions that "exchange" segments from different chromosomes and fission/fusion that respec- 
tively "cut"/"glue" chromosomes. 

Since large-scale rearrangements happen rarely and have dramatic effect on the 
genomes, the number of rearrangements (genomic distance 1 ) between two genomes rep- 
resents a good measure for their evolutionary remoteness and often is used as such in 
phylogenomic studies. Depending on the model of rearrangements, there exist different 
types of genomic distance [10]. 

Particularly famous examples are the reversal distance between unichromosomal 
genomes [12] and the genomic distance between multichromosomal genomes under all 
aforementioned types of rearrangements [11]. Despite that both these distances can be 
computed in polynomial time, their analysis is somewhat complicated, thus limiting their 
applicability in complex setups. The situation becomes even worse when the chosen 
model includes more "complex" rearrangement operations such as transpositions that cut 
off a segment of a chromosome and insert it into some other place in the genome. Compu- 
tational complexity of most distances involving transpositions, including the transposition 
distance, remains unknown [13, 4, 8]. To overcome difficulties associated with the anal- 
ysis of genomic distances many researchers now use simpler models of multi-break [3], 
DCJ [14], block-interchange [7] rearrangements as well as circular instead of linear genomes, 
which give reasonable approximation to original genomic distances [1]. 

Another obstacle in genomic distance-based approaches arises from the fact that 
transposition-like rearrangements are at the same time much rare and "powerful" than 
reversal-like rearrangements. As a result, in models that include both reversals and trans- 
positions, the genomic distance typically corresponds to rearrangement scenarios with a 
large proportion of transpositions, which is not biologically adequate. A traditional ap- 
proach to bounding the proportion of transpositions is weighted genomic distance defined 
as the minimum weight of a transformation between two genomes, where transpositions 
are assigned a relative weight a > 1 [10]. A number of previous studies addressed the 
weighted genomic distance for a < 2. In particular, Bader and Ohlebusch [4] devel- 
oped a 1.5-approximation algorithm for a e [1,2]. For a = 2, Eriksen [9] proposed a 
(1 + e)-approximation algorithm (for any e > 0). 

Employing the model of multi-break rearrangements [3] on circular genomes, that 
captures both reversals (modelled as 2-breaks) and transpositions (modelled as 3-breaks), 
we prove that for a e (1, 2], a minimum-weight transformation may entirely consist of 
transpositions. Therefore, the corresponding weighted genomic distance does not actually 
achieve its purpose of bounding the proportion of transpositions. We further prove that 
for a G (1, 2), the minimum-weight transformations do not depend on a particular choice 
of a from this interval (thus are the same, say, for a = 1.001 and a = 1.999), and give 
a complete characterization of such transformations. In particular, we show that these 
transformations coincide with those that at the same time have the shortest length and 
make the smallest number of breakages in the genomes, first introduced by Alekseyev 
and Pevzner [2]. 

1 We remark that the term genomic distance sometimes is used to refer to a particular distance under 
reversals, translocations, fissions, and fusions. 




Figure 1 : a) Graph representation of a two-chromosomal genome P = (+a-b)(+c+e+d) as two black-obverse 
cycles and a unichromosomal genome Q - (+a + b- e + c-d) as a gray-obverse cycle, b) The superposition 
of the genomes P and Q. c) The breakpoint graph G(P, Q) of the genomes P and Q (with removed obverse 
edges). 

Our results also provide a theoretical foundation for the empirical observation of 
Blanchette et al. [6] that for a < 2, transpositions are favored over reversals in the 
minimum- weight transformations. 

2 Multi-break Rearrangements and Breakpoint Graphs 

We represent a circular chromosome on n genes X\,X2, . . . , x n as a cycle graph on 2n edges 
alternating between directed "obverse" edges, encoding genes and their directionality, 
and undirected "black" edges, connecting adjacent genes (Fig. la). A genome consisting 
of m chromosomes is then represented as m such cycles. The edges of each color form a 
perfect matching. 

A k-break rearrangement [3] is defined as replacement of a set of k black edges in a 
genome with a different set of k black edges forming matching on the same 2k vertices. 
In the current study we consider only 2-break (representing reversals, translocations, 
fissions, fusions) and 3-break rearrangements (including transpositions). 

For two genomes P and Q on the same set of genes, 2 represented as black-obverse 
cycles and gray-obverse cycles respectively, their superposition is called the breakpoint 
graph G(P, Q) [5]. Hence, G(P, Q) consists of edges of three colors (Fig. lb): directed 
"obverse" edges representing genes, undirected black edges representing adjacencies in 
the genome P, and undirected gray edges representing adjacencies in the genome Q. We 
ignore the obverse edges in the breakpoint graph and focus on the black and gray edges 
forming a collection of black-gray alternating cycles (Fig. lc). 

A sequence of rearrangements transforming genome P into genome Q is called trans- 
formation. The length of a shortest transformation using fc-breaks (k = 2 or 3) is called the 
k-break distance between genomes P and Q. 



2 From now on, we assume that given genomes are always one the same set of genes. 



P=(+a -b) (+c +e +d) P'=(+a -b) (+c -d -e) Q=(+a +b -e +c -d) 

Q=(+a +b -e +c -d) Q=(+a +b -e +c -d) Q=(+a +b -e +c -d) 




Figure 2: A transformation between the genomes P and Q (defined in Fig. 1) and the corresponding 
transformation between the breakpoint graphs G(P, Q) and G(Q, Q) with a 2-break followed by a complete 
3-break. 



Any transformation of a genome P into a genome Q corresponds to a transformation 
of the breakpoint graph G(P, Q) into the identity breakpoint graph G(Q, Q) (Fig. 2). A close 
look at the increase in the number of black-gray cycles along this transformation, allows 
one to obtain a formula for the distance between genomes P and Q. Namely, the 2-break 
distance is related to the number c(P, Q) of black-gray cycles in G(P, Q), while the 3-break 
distance is related to the number c odd (P, Q) of odd black-gray cycles (i.e., black-gray cycles 
with an odd number of black edges): 

Theorem 1 ([14]). The 2-break distance between genomes P and Q is 

d 2 (P,Q) = \P\-c(P,Q). 
Theorem 2 ([3]). The 3-break distance between genomes P and Q is 

. , p n x \P\-c° dd (P,Q) 

«3("/Q) = n • 



3 Breakages and Optimal Transformations 

Alekseyev and Pevzner [2] studied the number of breakages 3 in transformations. The 
number of breakages made by a rearrangement is defined as the actual number of edges 
changed by this rearrangement. A 2-break always makes 2 breakages, while a 3-break can 
make 2 or 3 breakages. A 3-break making 3 breakages is called complete 3-break. We treat 
non-complete 3-breaks as 2-breaks. 

Alekseyev and Pevzner [2] proved that between any two genomes, there always exists 
a transformation that simultaneously has the shortest length and makes the smallest 
number of breakages. We call such transformations optimal. 

For a 3-break r, we let n 3 (r) = 1 if r makes 3 breakages (i.e., r is a complete 3-break) and 
n 3 (r) = otherwise. For a transformation t, we further define 

Mt) = ^ (1 - n 3 (r)) and n 3 (t) = ^ n 3 (r) 

ret ret 



3 In [2], the term break is used. We use breakage to avoid confusion with A:-break rearrangements. 



that is, n 2 (t) and n 3 (t) are correspondingly the number of 2-breaks and complete 3-breaks 
in t. If 2-breaks and complete 3-breaks are assigned respectively the weights 1 and a, then 
the weight of a transformation t is 



It is easy to see that a transformation t has the length n 2 (t) + n 3 (t) = Wi(t) and makes 
2 • n 2 (t) + 3 • n 3 (t) = 2 ■ W 3 / 2 (t) breakages overall. Therefore, a transformation is optimal 
if and only if it simultaneously minimizes Wi(f) and W 3 / 2 (f). We generalize this result in 
Section 4 by showing that 3 /i can be replaced with any a e (1, 2). 

For a rearrangement r applied to a breakpoint graph, let A r c° dd and A r cf° en be the result- 
ing increase in the number of respectively odd and even black-gray cycles, respectively 
Clearly, A r c odd + A r c emn = A r c gives the increase in the total number of black-gray cycles. 

Lemma 3. For any 3-break r, 

• \A r c\ < 1 + n 3 (r); 

• A r c odd is even and \A r c odd \ < 2; 

• \A r c even \ < l + n 3 (r). 

Proof. A 3-break r operating on black edges in the breakpoint graph G(P, Q) destroys at 
least one and at most three black-gray cycles. On the other hand, it creates at least one 
and at most three new black-gray cycles. Therefore, |A r c| < 3-1 = 2. Similarly, if n 3 (r) = 0, 
then |A r c| < 2 - 1 = 1. 

By similar arguments, we also have |A r c odd | < 3 and \A r c even \ < 3. 

Since the total number of black edges in destroyed and created black-gray cycles is the 
same, A r c odd must be even. Combining this with |A r c | < 3, we conclude that \A r c cdd \ < 2. 

If A r c even = 3, then the destroyed cycles must be odd, implying that A r c odd = -2. 
However, it is not possible for a 3-break to destroy two cycles and create three new cycles. 
Hence, A r c even t 3. Similarly, A r c even + -3, implying that \A r c ez ' en \ < 2. If n 3 (r) = (i.e., r is a 
2-break), similar arguments imply \A r c even \ < 1. □ 

Lemma 4. A transformation t between two genomes is shortest if and only if A r c odd = 2 for every 
ret. Furthermore, ift is a shortest transformation between two genomes, then for every ret, 

• ifn 3 (r) = 0, then A r c evm = -1; 

• ifn 3 (r) = 1, then A r c even = or -2. 

Proof. A transformation t of a genome P into a genome Q increases the number of odd 
black-gray cycles from c odd (P, Q) in G(P, Q) to c odd (Q, Q) = \P\ in G(Q, Q) with the total 
increase of \P\ - c odd (P, Q) = 2 ■ d 3 (P, Q). By Lemma 3, A r c odd < 2 for every r e t and thus 



W a (t) = n 2 (t) + a ■ n 3 (t). 



2-d 3 (P,Q) = Y J KC 
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implying that |f| = d 3 (P, Q) (i.e., t is a shortest transformation) if and only if A r c = 2 for 
every ret. 
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Figure 3: A 3-break r with A r c = 2 and A r <f ven = -2, transforming two even black-gray cycles into two 
odd black-gray cycles. Such 3-breaks may appear in shortest transformations (Lemma 4) but not in optimal 
ones (Theorem 5). 

Now let t be a shortest transformation and thus A r c odd = 2 for every ret. For a 2-break 
r to have A r c odd = 2, it must be applied to an even black-gray cycle and split it into two odd 
black-gray cycles. Thus any such r also decreases the number of even black-gray cycles 
by 1, i.e., A r c even = -1. 

If a complete 3-break r has A r c odd = 2, then A r c even = A r c - A r c odd < 2 - 2 = 0. By 
Lemma 3, we also have A r c cven > -2 and A r c even + —1, implying that A r c even = or -2. □ 

By the definition, any optimal transformation is necessarily shortest. However, not 
every shortest transformation is optimal. The following theorem characterizes optimal 
transformations within the shortest transformations: 

Theorem 5. A shortest transformation t between two genomes is optimal if and only if for any 
ret, A r c even + -2. 

Proof Let t be a shortest transformation t between two genomes. By Lemma 4, n 3 (t) = u + v 
where u is the number of complete 3-breaks with A r c even = and v is the number of complete 
3-breaks with A r c even = -2 (Fig. 3). 

With n 2 (t) 2-breaks and n 3 (t) = u + v complete 3-breaks G(P, Q) is transformed into 
G(Q, Q) with \P\ = \Q\ trivial black-gray cycles, which are all odd. By Lemma 4, for the 
increase in the number of odd and even black-gray cycles in the breakpoint graph, we 
have: 




implying that 



W 3/2 (f) 



"2(0 + ^ U + v "> 




which is minimal if and only if v = 0, i.e., A r c even + —2 for any ret. 
Lemma 4 and Theorem 5 imply: 



□ 



Corollary 6. A transformation t between two genomes is optimal if and only if for any ret, 



• ifn 3 (r) = 0, then A r c odd = 2 and A r c even = -1; 

• ifn 3 (r) = 1, then A r c odd = 2 and A r c even = 0. 



Theorem 7. A transformation t between genomes P and Q is optimal if and only if 

f n 2 (t) = c™ n (P,Q), 

n 3 (0=^»-c-(P,Q). {) 

Proof. Let t be an optimal transformation between genomes P and Q. Then with n2(t) 
2-breaks and n 3 (t) complete 3-breaks, it transforms G(P, Q) into G(Q, Q) with |P| = |Q| 
trivial black-gray cycles, which are all odd. By Corollary 6, we have 

\c odd {P,Q) + 2{n 2 {t) + n 3 {t)) = \P\, 
m (P,Q)-n 2 (t) = 0, 

implying formulae (1). 

Vice versa, a transformation t between genomes P and Q, satisfying (1), has the length 

n 2 (t)+n 3 (t) = — — 2 = d 3 (P,Q), implying that tis a shortest transformation. By Lemma 4, 
A r c even = -1 for every 2-break ret and A r c even = or -2 for every complete 3-break ret. 
Let v be the number of complete 3-breaks ret with A r c even = -2. Then the increase in the 
number of even black-gray cycles along t is 

_ (fm( jp t Q) = _ nz(f) _ 2v = -c™\P, Q) - 2y, 

implying that u = and thus t is optimal by Theorem 5. □ 

Theorem 7 implies that for some genomes, every optimal transformation consists 
entirely of complete 3-breaks: 

Corollary 8. For genomes P and Q with c even (P, Q) = 0, every optimal transformation t has 
n 2 (t) = and thus consists entirely of complete 3-breaks. 

Corollary 9. For an optimal transformation t between genomes P and Q, 

(\p\- r odd (P O) \ 
W a (t) = c even {P, Q) + a • ^l^Z _ ffen (P/ Q) _ 



4 Weighted multi-break distance 

Let T(P, Q) be the set of all transformations between genomes P and Q. For a real number 
a, we define the weighted distance D a (P, Q) between genomes P and Q as 

D a (P,Q)= min W*(f) 

t€T(P,Q) 

that is, the minimum possible weight of a transformation between P and Q. 



Two important examples of the weighted distance are the "unweighted" distance 
Di(P,Q) = d 3 (P, Q) and the distance D3/ 2 (P,Q) equal the half of the minimum number 
of breakages in a transformation between genomes P and Q. By the definition of an 
optimal transformation, we have D 3 / 2 (P, Q) = W 3 / 2 (io)/ where to is an optimal transformation 
between genomes P and Q. Below we prove that D a (P, Q) = W a (t ) for any a 6 (1, 2]. 

Theorem 10. For a £ (1, 2], 

D a (P,Q) = W a (t ), 

where t is any optimal transformation between genomes P and Q. 

Furthermore, for a e (1,2), ifD a (P,Q) = W a (t)fora transformation t between genomes P and 
Q, then t is an optimal transformation. 

Proof. Let t be any transformation and t be any optimal transformation between genomes 
PandQ. 

We classify all possible changes in the number of even and odd black-gray cycles 
resulted from a single rearrangement r. By Lemma 3, A r c odd may take only values -2, 0, 2, 
while |A,,c| = \A r c odd + A r c even \ < 1 (if r is a 2-break) or < 2 (if r is a complete 3-break). The 
table below lists the possible values of A r c odd and A r c even , satisfying these restrictions, along 
with the amount of rearrangements of each particular type in t, denoted x, for 2-breaks 
and y ; for complete 3-breaks. 
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For the transformation t, we have 



n 2 (0 = Xi + X2 + x 3 + x 4 + x 5 , 

Mt) = yi + y2 + ys + y4 + ys + ye + y? + ys + yg + yio + yu- 

Calculating the total increase in the number of odd and even black-gray cycles along 
t, we have 

-2x 4 + 2x 5 + 2y 6 + 2y 7 + 2y 8 - 2y 9 - 2y 10 - 2y n = |P| - c odd {P, Q), 
x 2 - x 3 + x 4 - x 5 + y 2 - y 3 + 2y 4 - 2y 5 - y 7 - 2y 8 + y 10 + 2yn = -c rae "(P, Q). 

Theorem 7 further implies 

n 2 (i ) = -x 2 + x 3 - x 4 + x 5 - y 2 + y 3 - 2y 4 + 2y 5 + y 7 + 2y 8 - y 10 - 2y n , 
"3(^0) = x 2 - x 3 + y 2 - y 3 + 2y 4 - 2y 5 + y 6 - y 8 - y 9 + y n . 

Now we can evaluate the difference between the weights of t and t as follows: 
W a (t) - W a (t ) = n 2 (t) - n 2 (t ) + a ■ (n 3 (t) - n 3 (t )) 

= Xi + 2x 2 + 2x 4 + y 2 -y 3 + 2y 4 - 2y 5 - y 7 - 2y 8 + y 10 + 2y n 
+ a ■ (-x 2 + x 3 + yi + 2y 3 - y 4 + 3y 5 + y 7 + 2y 8 + 2y 9 + y 10 ) 
= Xi + (2 - a) ■ x 2 + a ■ x 3 + 2x 4 + « ■ y\ + y 2 + (2a - 1) ■ y 3 + (2 - a) • y 4 
+ (3a - 2) • y 5 + (a - 1) • y 7 + (2a - 2) • y 8 + 2a ■ y 9 + (a + 1) • y w + 2 ■ y n . 



Since a e (1, 2] and x„ y ; > 0, all summands in the last expression are nonnegative and 
thus W a (t) - W a (to) > 0. Since t is an arbitrary transformation, we have 

D a (P,Q) = W a (t ). 

For a G (1,2), if D a (P, Q) = W a (t) then W a (t) - W a (t ) = 0, implying that only x 5 and y 6 
(appearing with zero coefficients in the expression for W a (t) - W a (to)) can be nonzero and 
thus t is optimal by Corollary 6. □ 

5 Discussion 

We proved that for a e (1, 2], the minimum-weight transformations include the optimal 
transformations (Theorem 10) that may entirely consist of transposition-like operations 
(modelled as complete 3-breaks) (Corollary 8). Therefore, the corresponding weighted 
genomic distance does not actually impose any bound on the proportion of transpositions. 

For a G (1,2), we proved even a stronger result that the minimum-weight transfor- 
mations coincide with the optimal transformations (Theorem 10). As a consequence we 
have that a particular choice of a e (1, 2) imposes no restrictions for the minimum-weight 
transformations as compared to other values of a from this interval. The value a = 3 /i then 
proves that the optimal transformations coincide with those that at the same time have 
the shortest length and make the smallest number of breakages, studied by Alekseyev 
and Pevzner [2]. We further characterized the optimal transformations within the shortest 
transformations (i.e., the minimum-weight transformations for a = 1) by showing that the 
optimal transformations avoid one particular type of rearrangements (Theorem 5, Fig. 3). 

It is worth to mention that the weighted genomic distance with a > 2 is useless, since 
it allows (for a = 2) or even promotes (for a > 2) replacement of every complete 3-break 
with two equivalent 2-breaks, thus eliminating complete 3-breaks at all. 

The extension of our results to the case of linear genomes will be published elsewhere. 
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